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ABSTRACT 


This  technical  note  presents  a  proposal  to  perform  the 
analysis  of  DNA  sequences  with  an  analogue  optical  computer.  The 
DNA  analysis  involves  the  computation  of  a  massive  amount  of 
correlations.  A  time-integrating  correlator  is  an  ideal  tool  to 
perform  that  processing  at  a  very  fast  speed.  A  design  based  on 
commercially  available  equipment  is  presented  together  with  a 
comparison  of  the  processing  time  of  the  system  with  conventional 
computer  technology.  The  speed  of  this  design  is  orders  of 
magnitude  greater  than  existing  techniques.  An  overview  of  the 
technology  already  available  for  such  a  project  is  presented 
together  with  an  outline  of  the  areas  that  need  more  development. 


RESUME 

Cette  note  technique  propose  un  systeme  d* analyse  des 
sequences  d'ADN  par  un  ordinateur  analogique  optique.  L' analyse 
de  sequences  d'ADN  implique  le  calcul  d'une  enorme  quant ite  de 
correlations.  Un  correlateur  a  integration  temporelle  est  un 
outil  ideal  pour  effectuer  rapidement  ce  type  d' operation.  On 
presente  un  systeme  congu  a  partir  d'equipement  commercialement 
disponible  ainsi  qu'une  comparaison  entre  ses  temps  de  traitement 
et  les  temps  de  traitement  d ' ordinateurs  conventionnels. 

L' amelioration  de  la  vitesse  de  traitement  est  de  plusieurs 
ordres  de  grandeur  et  est  suffisante  pour  permettre  d'envisager 
le  traitement  de  tout  le  genome  humain.  On  presente  finalement 
une  revue  de  la  technologie  deja  disponible  pour  la  construction 
du  systeme  ainsi  qu'un  apergu  des  domaines  necessitant  encore  du 
developpement . 
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EXECUTIVE  SUMMARY 


Molecular  biologists  tell  us  that  each  cell  in  our  body 
carries  all  the  information  necessary  to  reconstruct  the  entire 
organism.  This  information  is  stored  in  a  molecular  structure 
called  DNA  and  the  analysis  of  DNA  sequences  is  of  particular 
interest  for  the  understanding  of  the  basic  processes  governing 
life.  In  that  context,  the  mission  of  the  Human  Genome  Project 
is  to  map  the  entire  mosaic  of  the  human  DNA.  In  an  effort  to 
reach  that  objective,  biochemists  try  to  match  a  particular 
segment  of  DNA  to  existing  data  banks,  with  the  possibility  that 
the  match  will  not  be  perfect.  Correlation  techniques  implemented 
on  digital  computers  are  used  to  perform  the  analysis  on  the 
limited  amount  of  data  available  today  and  the  process  is  ^ 
tedious.  Considering  that  only  a  small  fraction  of  the  3x10 
human  genome  nucleotides  is  now  available  in  the  data  banks,  a 
mapping  of  the  entire  human  genome  requires  a  computational 
breakthrough . 

This  technical  note  proposes  a  new  method  to  perform  the 
analysis  of  human  or  animal  DNA  sequences  with  an  analogue 
optical  computer.  The  new  method  is  characterized  by  short 
processing  times  that  make  the  analysis  of  the  entire  human 
genome  a  tractable  enterprise.  The  proposal  is  based  on  the 
utilization  of  a  Time-Integrating  Correlator  (TIC) .  This  type  of 
optical  correlator  is  particularly  well  suited  to  the  very  fast 
correlation  of  long  data  streams  such  as  the  data  involved  in  the 
analysis  of  DNA.  A  design  based  on  commercially  available 
equipment  is  presented  together  with  a  comparison  of  the 
processing  time  of  the  system  with  conventional  computer 
technology.  Comparison  of  the  expected  processing  times  of  a 
TIC,  for  a  particular  case,  leads  to  the  conclusion  that  the  TIC 
could  be  10  times  faster  than  a  80  Mega  Instructions  Per  Second 
(MIPS)  computer  and  over  375  times  faster  than  a  personal 
computer.  An  rverview  of  the  technology  already  available  for 
such  a  project  and  an  outline  of  the  areas  that  need  further 
development  is  also  included. 
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:  Time-integrating  correlator:  Mach-Zehnder 
architecture.  The  beam  splitter  separates 
the  incident  laser  beam  into  two  paths. 

Ml  and  M2  are  folding  mirrors.  The  two 

beams  diffracted  by  the  Bragg  cells  are 

mixed  together  by  a  beam  mixer.  The  two 

diffracted  light  distributions  are  coaxial 

and  imaged  in  such  a  way  as  to  be 

counterpropagating  on  the  detector  array 

that  performs  a  time-integration.  4 

:  Bragg  cell  operation.  The  electrical  input 
is  applied  to  the  piezoelectric  transducer 
that  generates  a  moving  grating  of  changing 
indices  of  refraction.  That  moving  grating 
diffracts  some  of  the  light  illuminating 
the  Bragg  cells  and  the  information  contained 
in  the  electrical  input  is  transferred  to  the 
diffracted  laser  beam.  5 

:  Typical  output  from  a  TIC:  (1) -correlation  peak 
formed  by  the  AxB  term  and  (2) -pedestal  formed 
by  the  A*  +  B^  terms.  7 

:  Short  representations  of  the  DNA  bases  where 
each  base  is  represented  by  a  7 -bits  long 
pseudorandom  sequence.  9 

:  The  flow  of  data  in  a  DNA  analysis  system 
based  on  an  optical  TIC.  On  the  left  side 
the  human  genome  has  a  potential  of  3 
billion  bases.  The  50  million  bases  that 
are  known  are  stored  in  a  digital  database 
where  they  are  designated  by  letters.  These 
letters  are  then  represented  by  pseudorandom 
binary  sequences  and  transformed  into  analogue 
signals  which  are  suitable  to  operate  a  Bragg 
cell.  The  right  side  represent  the  new  data 
(query  sequence)  acquired  by  a  scientist.  It 
undergoes  the  same  transformation  and  is 
correlated  by  the  TIC  with  the  data  from  the 
database  on  the  left  side.  The  results  are 
displayed  and  if  the  query  sequence  was  not 
already  included  in  the  known  DNA  data  base, 
it  is  incorporated.  10 
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Figure  6:  Coarse  analysis  of  a  DMA  sequence.  A  database 
is  illustrated  as  it  propagates  through  Bragg 
cell  A  just  before  the  passage  of  the  segment 
that  is  identical  to  the  query  sequence.  The 
signal  formed  by  the  repetitions  of  the  query 
sequence  is  illustrated  at  the  same  moment  in 
Bragg  cell  B.  The  correlation  peak  will  start 
formation  a  few  moments  later,  in  about  the 
transit  time  in  the  Bragg  cell  divided  by  two.  11 

Figure  7:  Fine,  base-by-base  analysis  of  a  DMA  sequence. 

The  database  and  the  query  sequence  are 
represented  by  long  pseudorandom  sequences 
that  almost  fill  the  Bragg  cells'  apertures. 

The  system  is  illustrated  at  the  moment  when 

the  base  G  is  correlating.  12 

Figure  8:  Processing  time  for  the  analysis  of  a  50x10*^ 
bases  database  as  a  function  of  the  number  of 
bases  in  the  query  sequence.  The  left  of  the 
figure  uses  log-log  axis  and  covers  query 
sequences  of  length  12  to  857.  Semi-log  axis 
are  more  convenient  for  the  right  of  the  figure 
because  the  analysis  time  varies  linearly  with 
the  length  of  the  query  sequence.  The  abscissa 
and  ordinate  are  respectively  drawn  on  a 
logarithmic  scale  and  a  linear  scale.  15 
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1 . 0  INTRODUCTION 


Optical  information  processing  has  been  developing  since  the 
early  1960 's,  first  slowly,  then  at  an  accelerated  pace.  It  is 
now  a  field  of  intense  activity.  Although  the  situation  is 
likely  to  change  quickly  from  now  on,  analogue  optical  computing 
has  been  the  area  of  optical  processing  that  has  had  the  most 
success  in  terms  of  the  development  of  practical  systems. 
Acousto-optic  spectrum  analyzers  for  the  processing  of  wideband 
military  radar  signals  and  synthetic  aperture  radar  correlators 
are  probably  among  the  best  examples  of  analogue  optical 
processors  dedicated  to  specific  applications.  However,  the 
achievement  of  a  truly  significant  breakthrough  in  optical 
computing  has  been  elusive  [1]. 

In  this  paper,  we  are  proposing  the  application  of  a  late 
1970 's  concept  to  the  solution  of  a  1990 's  problem.  The  problem 
considered  here  is  the  analysis  of  human  or  animal  DNA  sequences 
where  biochemists  attempt  to  match  a  query  sequence  of  DNA  to  an 
identical  or  a  similar  segment  that  may  be  present  in  the 
existing  computer  databases.  The  genome  of  a  particular  living 
organism  is  all  its  genetic  information  that  is  encoded  in  DNA 
sequences.  In  this  context,  the  mission  of  the  Human  Genome 
Project  [2-4]  is  to  map  and  sequence  the  entire  mosaic  of  the 
human  DNA.  Correlation  techniques  implemented  on  digital 
computers  are  used  to  do  the  sequence  matching  on  the  limited 
amount  of  data  available  today  and  the  process  is  |:edious. 
Considering  that  only  a  small  fraction  of  the  3xlo’  human  genome 
nucleotides  is  now  available  and  stored  in  the  data  banks,  a 
computational  breakthrough  is  required  to  allow  the  processing  of 
the  entire  human  genome. 

The  solution  that  we  are  proposing  for  the  analysis  of  DNA 
sequences  is  to  use  a  Time-Integrating  Correlator  (TIC)  whose 
theory  and  architecture  were  studied  in  the  late  1970 's  and  the 
1980 's.  This  type  of  optical  correlator  is  particularly  well 
suited  to  the  very  fast  correlation  of  long  data  streams  such  as 
the  data  involved  in  the  analysis  of  DNA.  The  limitations  on 
dynamic  range  that  are  a  problem  in  the  application  of  analogue 
computing  to  noisy  radar  or  communication  signal  processing,  are 
not  a  problem  here.  The  data  to  be  correlated  comes  from  a 
computer  database  and  is  noiseless. 

DNA  sequences  are  built  from  four  bases  represented  by  the 
letters  A,  C,  G  and  T.  A  fifth  letter  ,N,  is  used  to  represent 
unknown  elements  at  particular  locations  in  a  sequence.  Optical 
approaches  have  already  been  considered  for  the  analysis  of  DNA 
secjuences.  Vander  Lugt  space  integrating  correlators  have  been 
proposed  [5,6].  In  this  approach,  the  DNA  bases  and  the  sequence 
they  form  are  represented  by  two  dimensional  arrays  on  which 
pattern  recognition  is  performed  with  a  Vander  Lugt  space 
integrating  correlator.  The  advantages  of  using  the  TIC  approach 
are  that  there  is  no  need  to  transform  the  one-dimensional  DNA 
sequences  into  two-dimensional  patterns  and  the  complex  process 
of  generating  and  changing  matched  filters  is  eliminated. 


In  an  operational  system,  it  is  proposed  that  the  TIC  would 
be  an  optical  black  box  performing  rapid  correlations  under  the 
control  of  a  computer.  The  TIC  would  be  a  high  performance 
correlation  module  integrated  to  a  software  environment  already 
familiar  to  the  users.  The  crucial  difference  would  be  the 
increased  speed  of  operation  of  the  system.  In  this  respect,  the 
proposed  system  meets  the  concerns  for  gradual  insertion  of 
optical  technology [7]  into  information  processing  systems.  This 
is  viewed  as  necessary  for  the  progress  of  optical  computing  in 
the  1990 's  and  beyond. 


2.0  SCOPE  OF  THE  PROBLEM  AND  CURRENT  TECHNOLOGY 

All  living  organisms  encode  their  genetic  information  in  the 
same  way,  by  using  linear  polymers  of  phosphoric  acid  and  sugar 
(deoxyribose)  upon  which  are  attached  four  different  bases, 
adenine  (A) ,  cytosine  (C) ,  guanine  (G)  and  thymine  (T) .  These 
linear  polymers  of  very  long  extent  (oie  chromosome  of  4xlo'^ 
units  ^or  a  typical  bacterium,  twenty-three  chromosomes  of  up  to 
200x10  units  for  human  beings)  contain  regions  called  genes, 
which  are  translated  into  proteins,  as  well  as  regulatory  regions 
and  regions  of  as  yet  unknown  function.  These  linear  polymers 
can  be  read  seguentially  by  chemical  and  enzymatic  techniques  and 
the  resulting  linear  information  interpreted  as  to  their 
function.  Of  particular  interest  in  the  human  genome  are  regions 
responsible  for  genetics  defects;  once  these  are  located  and 
identified,  then  early  treatment  may  become  possible. 

Over  the  past  ten  year,  DNA  sequencing  techniques  have 
advanced  sufficiently  for  a  modest  start  to  be  made  on  harvesting 
and  analyzing  the  formidable  array  of  genetic  diversity  in  life 
forms.  Most  of  the  DNA  sequence  information  available  today  is 
tabulated  in  the  GenBank*  databuse.  Release  65  (September  1990) 
of  this  database  contains  49x10  nucleotides  from  all  organisms, 
divided  into  thirteen  divisions.  The  Primate  division,  where 
human  sequence  data  is  located,  accounts  for  8x10^  nucleotides. 
While  the  amount  of  information  harvested  so  far  is  very  small 
compared  even  to  a  single  human  genome,  it  is  already  apparent 
that  present  day  computer  technology  will  be  unable  to  deal  with 
future  developments.  A  complete  search  of  GenBank  Release  65 
with  a  query  sequence  of  3000  bases  currently  takes  about  8 
minutes  of  CPU  time  on  a  80  MIPS  mainframe  computer  or  over  five 
hours  on  a  personal  computer  operating  at  25  MHz.  As  the 
database  grows  towards  its  projected  size  of  3xlo’  for  the  human 
genome  alone  (discounting  inevitable  overlaps,  repetitions  and 
person-to-person  variations) ,  it  can  be  foreseen  that  current 
equipment  will  quickly  become  utterly  impractical  to  use. 


* 


Produced  by  GenBank  c/o  IntelliGenetics  Inc.  700  East  El 
Camino  Real,  Mountain  View  CA  94040. 


3.0  TIME-INTEGRATING  CORRELATORS 


Time*- Integra  ting  Correlators  are  analogue  optical  computers 
designr  perform  the  correlation  of  two  signals.  The  many 

possible  ways  to  build  TICs  are  well  documented  in  the  literature 
[8-27]  and  good  review  papers  are  available  [15,21,27].  The 
various  factors  affecting  the  performance  of  these  systems  have 
also  been  extensively  studied  [28-33].  We  present  in  this 
section  a  brief  review  of  the  principle  of  operation  of  TICs  with 
emphasis  on  the  characteristics  and  parameters  that  have  an 
impact  on  the  design  and  operation  of  a  TIC  applied  towards  the 
analysis  of  DNA  sequences. 

We  have  chosen  to  illustrate  the  concept  of  a  TIC  using  the 
Mach-Zehnder  architecture  (see  Figure  1) .  Other  architectures 
may  have  distinct  implementation  advantages,  such  as  compactness 
[17-19],  but  it  is  easier  to  explain  the  principle  of  operation 
of  the  TICs  by  using  the  Mach-Zehnder  configuration  as  a  model. 

The  first  operation  performed  by  a  TIC  is  the  transformation 
of  the  electrical  signals  A  and  B  to  be  correlated  into  modulated 
light  beams.  The  two  input  signals  are  applied  to  piezoelectric 
transducers  attached  to  the  Bragg  cell  crystals  and  acoustic 
waves  are  generated.  They  propagate  in  the  Biagg  cell  crystals 
(see  Figure  2)  thus  forming  a  moving  grating  of  changing  indices 
of  refraction.  The  two  Bragg  cells  are  then  illuminated  by 
expanded  laser  beams  and  the  laser  light  interacts  with  the 
acoustic  waves.  Some  of  the  incident  light  is  diffracted  through 
the  acousto-optic  interaction.  The  infoirmation  contained  in  the 
electrical  signals  applied  to  the  Bragg  cells  is  thus  transferred 
to  the  diffracted  laser  beams.  The  relative  positions  of  the 
Bragg  cells  and  of  the  illuminating  beams  are  arranged  to  produce 
two  diffracted  beams  that  are  mixed  together  with  a  half-silvered 
mirror.  The  signals  propagating  in  the  Bragg  cells  are  imaged, 

with  a  lens,  onto  a  linear  array  of  light  detectors  in  such  a  way 

as  to  be  counterpropagating.  The  detector  array  performs  a 
coherent  addition  and  a  time-integration  of  the  two  images. 
Detailed  analysis  of  the  signal  produced  by  the  detector  array 
are  available  in  the  literature  [8 , 18, 21, 27 , 30  33 ] . 

We  will  use  here  simplified  expressions  that  emphasize  the 

ability  of  the  system  to  produce  the  correlation  of  the  two  input 

signals  A(t)  and  B(t) .  The  distance  along  the  Bragg  cells  and 
their  images  is  described  by  the  variable  z  and  the  origin,  z=0, 
which  is  defined  to  be  at  the  centre  of  the  Bragg  cells  and 
correspondingly,  at  the  centre  of  their  images  on  the  detector 
array  (see  Figure  1) .  If  A(t)  and  B(t)  are  the  signals  applied 
to  the  Bragg  cells,  T  is  the  integration  time  of  the  detector 
array  and  v  is  the  velocity  of  propagation  of  the  acoustic  signal 
in  the  Bragg  cells,  the  signal  S(T,z)  produced  by  the  detector 
array  can  be  described  by: 
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Incident  light 


Figure  2:  Bragg  cell  operation.  The  electrical  input  is 

applied  to  the  piezoelectric  transducer  that 
generates  a  moving  grating  of  changing  indices  of 
refraction.  That  moving  grating  diffracts  some  of 
the  light  illuminating  the  Bragg  cells  and  the 
information  contained  in  the  electrical  input  is 
transferred  to  the  diffracted  laser  beam. 
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S(T^)  -  fj.  ^(t^zlv)  +  Bit-zjvf  dt 


(1) 


S{TX)  -  A\t+zJv)dt-^jjS\t-z{v)dt+2j^A(t-*-zJv)B(t-zJv)dt  (2) 


The  first  two  terms  of  equation  2  correspond  to  a  pedestal 
on  which  rides  the  correlation  peak  formed  by  the  AB  term  (see 
Figure  3) .  The  presence  of  a  peak  indicates  that  the  two  input 
signals  A  and  B  are  identical  and  conversely  the  absence  of  the 
correlation  peak  indicates  that  the  two  inputs  are  different. 

The  presence  of  a  correlation  peak  with  a  reduced  height 
indicates  that  A  and  B  have  similarities  but  are  not  identical. 

One  of  the  most  interesting  characteristics  of  the  TIC  is 
that  the  correlation  peak  is  produced  at  the  meeting  point  of  the 
images  of  the  two  counterpropagating  signals  (see  Figure  1)  on 
the  detector  array.  If  r  is  the  transit  time  of  the  signals  in 
the  Bragg  cells,  the  width  of  the  time-delay  window  over  which  it 
is  possible  to  observe  the  peak  is  2t  because  the  signals  are 
counterpropagating  on  the  detector  array.  If  the  signal  duration 
is  longer  than  the  time-delay  window,  and  if  the  difference  of 
time  of  arrival  of  the  signals  is  such  that  the  meeting  point  is 
outside  the  time-delay  window,  then  no  correlation  peak  will  be 
observed.  It  will  then  be  necessary  to  try  different  time-shifts 
of  one  of  the  signals  to  move  the  correlation  peak  into  the 
time-delay  window  of  the  TIC.  The  time-shifts  should  be  designed 
to  produce  contiguous  or  slightly  overlapping  time-windows.  If 
bits  are  defined  as  the  0  and  I's  of  the  signals  applied  to  the 
Bragg  cells,  the  number  of  time-delay  steps  (measured  in  number 
of  bits)  that  can  be  processed  within  one  time-delay  window  of 
the  TIC  is  given  by  the  duration  of  the  time-delay  window  2t 
multiplied  by  the  bit  rate  of  the  input  signals,  D, 


2x  X  D. 


(3) 


It  is  also  desirable  to  emphasize  the  correlation  peak  by 
removing  the  pedestal  shown  in  Figure  3.  The  technique  used  here 
involves  subtraction  of  two  successive  frames  collected  by  the 
detector  with  a  180*  phase  shift  on  one  of  the  signals  applied  to 
the  Bragg  cells  for  the  collection  of  the  second  frame[8,21]. 

The  effective  integration  time  generated  by  the  phase-shift 
method  is  twice  the  integration  time  of  the  detector  array. 
Another  technique  uses  a  reference  frame  that  has  a  pedestal  but 
does  not  contain  a  peak.  It  is  regularly  updated  and  subtracted 
from  every  new  frame  that  is  collected.  This  pedestal  removal 
technique  produces  effective  integration  times  that  are  equal  to 
the  integration  time  of  the  detector  array  but  it  has  to  be  used 
in  conjunction  with  a  frame  that  always  contains  a  positive  peak. 
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Figure  3 


:  Typical  output  from  a  TIC;  ( 1) -correlation  peak 
formed  by  th^  AxB  term  and  (2) -pedestal  formed 
by  the  terms. 
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Most  aspects  of  the  operation  of  the  TIC  are  controlled  by  a 
computer  that  we  will  call  the  Controller.  The  Controller 
determines  the  input  of  the  data  to  the  TIC  and  the  collection  of 
data  from  the  detector  array,  including  pedestal  removal  and  peak 
detection. 


4.0  REPRESENTATIONS  OF  THE  DNA  BASES 

DNA  sequences  are  built  from  four  bases  represented  by  the 
letters  A,  C,  G  and  T.  A  fifth  letter,  N,  is  used  to  represent 
unknown  elements  at  particular  locations  in  a  sequence.  The 
sequences  representing  segments  of  the  human  genome  have  to  be 
transformed  into  electrical  signals  suitable  as  inputs  to  the 
Bragg  cells  (see  Figure  2) .  One  way  to  accomplish  this  is  to 
represent  each  base  by  a  binary  pseudorandom  sequence  of  the  type 
used  in  spread  spectrum  code  division  multiple  access 
communications  [34,  chap. 3].  The  bits  (0  and  I's)  specified  by 
the  representations  of  the  bases  can  be  implemented  using  binary 
phase-shift-keyed  modulation  [34,  p. 16-18].  The  short 
representations  listed  in  Table  1  and  Figure  4  which  have  been 
selected  for  the  low  value  of  their  cross-correlation  could  be 
used. 


Table  1:  Short  representations  of  the  dna  bases  where 
each  base  is  represented  by  7-bits  long 
pseudorandom  sequences 


Adenine  (A) 
Cytosine  (C) 
Guanine  (G) 
Thymine  (T) 
Unknown  (N) 


0  0  0  0  0  0  1 
0  10  0  110 
10  10  0  10 
110  10  0  0 
1110  10  1 


Figure  5  represents  the  flow  of  data  in  a  DNA  analysis 
system  based  on  an  optical  TIC.  On  the  left  side,  the  human 
genome  data  base  has  a  potential  of  3  billion  bases.  Currently 
there  are  approximately  50  million  bases  of  sequence  available 
from  all  living  organisms.  The  50  million  bases  that  are  known 
are  stored  in  a  digital  database  where  they  are  designated  by 
letters.  These  letters  are  then  represented  by  pseudorandom 
binary  sequences  and  transformed  into  analogue  signals  which  are 
suitable  to  operate  a  Bragg  cell.  The  right  side  represents  the 
new  query  sequence  acquired  by  a  scientist.  It  undergoes  the 
same  transformation  and  is  correlated  with  the  database  from  the 
left  side  by  the  TIC.  The  results  are  displayed  and  if  the  query 
sequence  was  not  already  included  in  the  known  DNA  database,  it 
is  incorporated  into  the  database. 
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Figure 


Adenine  (A) _ J 

Guanine  (G)  I  \ _ [ 

Cytosine  (C)  I  I  1 _ |  [ 

Thymine  (T)  I  I  | _ 


Unknown  (N) 


4:  Short  representations  of  the  DNA  bases  where  each 

base  is  represented  by  a  7-bits  long  pseudorandom 
sequence . 
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Figure  5;  The  flow  of  data  in  a  DNA  analysis  system  based  on 

an  optical  TIC.  On  the  left  side  the  human  genome 
has  a  potential  of  3  billion  bases.  The  50  million 
bases  that  are  known  are  stored  in  a  digital 
database  where  they  are  designated  by  letters. 

These  letters  are  then  represented  by  pseudorandom 
binary  sequences  and  transfo’-med  into  analogue 
signals  which  are  suitable  to  operate  a  Bragg  cell. 
The  right  side  represent  the  new  data  (query 
sequence)  acquired  by  a  scientist.  It  undergoes 
the  same  transformation  and  is  correlated  by  the 
TIC  with  the  data  from  the  database  on  the  left 
side.  The  results  are  displayed  and  if  the  query 
sequence  was  not  already  included  in  the  known  DNA 
data  base,  it  is  incorporated. 
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aperture  of 


aperture  of 
Bragg  Cell  B 


Figure  6:  Coarse  analysis  of  a  DNA  sequence.  A  database  is 

illustrated  as  it  propagates  through  Bragg  cell  A 
just  before  the  passage  of  the  segment  that  is 
identical  to  the  query  sequence.  The  signal  formed 
by  the  repetitions  of  the  query  sequence  is 
illustrated  at  the  same  moment  in  Bragg  cell  B. 

The  correlation  peak  will  start  formation  a  few 
moments  later,  in  about  the  transit  time  in  the 
Bragg  cell  divided  by  two. 
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Figure  7; 
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A  I 


aperture  of 
Bragg  Cell  A 


aperture  of 
Bragg  Cell  B 


Fine,  base-by-base  analysis  of  a  DNA  sequence.  The 
database  and  the  query  sequence  are  represented  by 
long  pseudorandom  sequences  that  almost  fill  the 
Bragg  cells'  apertures.  The  system  is  illustrated 
at  the  moment  when  the  base  G  is  correlating. 
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5.0  DNA  ANALYSIS  STRATEGY 


The  purpose  of  this  section  is  to  present  a  strategy  to 
implement  the  analysis  of  a  DNA  sequence  with  a  TIC.  We  wish  to 
find  segments  of  the  database  that  are  identical  or  similar  to 
the  query  sequence  and  their  location  within  the  database.  We 
also  want  to  produce  a  base-by-base  comparison  of  the  query 
sequence  using  the  segments  of  the  database  that  are  identified 
as  correlating  with  the  query  sequence.  The  analysis  is  made 
using  a  two-level  procedure.  A  coarse  analysis  is  first  used  to 
locate  the  areas  of  the  database  that  are  similar  or  identical  to 
the  query  sequence  (see  Figure  6) .  Then,  a  fine  analysis  (see 
Figure  7) ,  is  performed  on  the  database  segments  identified  by 
the  coarse  analysis  to  establish  the  map  of  conformity. 


6.0  STRATEGY  FOR  COARSE  ANALYSIS 
6.1  Introduction 

The  purpose  of  the  coarse  analysis  is  to  find  the  areas  of 
the  database  that  are  similar  to  the  query  sequence.  The  process 
involved  in  the  production  of  the  correlation  peaks  for  the 
coarse  analysis  consists  of  sending  the  database  sequence  without 
interruption  through  Bragg  cell  A  (see  Figure  6) . 

Simultaneously,  the  query  sequence  is  passed  through  Bragg  cell  B 
continuously.  The  output  of  the  detector  array  is  examined  at 
regular  intervals  T.  The  pedestal  is  removed  and  the  presence  of 
a  peak  is  verified  by  comparison  with  a  preset  threshold  level 
for  each  collected  frame.  The  setting  of  the  threshold  level 
determines  the  degree  of  similarity  that  is  required  to  declare 
that  a  certain  segment  of  the  database  correlates  with  the  query 
sequence.  The  higher  the  peak,  the  better  the  correlation 
between  the  query  sequence  and  the  database.  These  operations 
can  be  performed  in  real  time  with  a  proper  hardware 
implementation.  When  a  segment  of  the  database  in  Bragg  cell  A 
is  identical  or  sufficiently  similar  to  the  query  sequence  in 
Bragg  cell  B,  correlation  peaks  will  be  produced  and  detected. 

The  time  of  occurrence  of  such  events  is  associated  with  the 
position  of  the  query  sequence  in  the  database  and  can  be 
determined  by  knowing  which  frame  contains  the  correlation.  All 
the  occurrences  of  a  correlation  peak  will  be  noted  and  the  fine 
analysis  will  follow  to  obtain  a  base-by-base  comparison  of  the 
query  sequence  with  the  database. 

The  length  of  the  query  sequences  used  can  be  very 
different.  A  system  that  would  handle  query  sequences  containing 
Between  12  and  10  bases  would  be  considered  a  valuable  research 
tool  by  biochemists.  Although  only  approximately  SOxlo'^  bases 
of  DNA  sequence  are  presently  identified  from  all  living 
species,  the  system  should  be  able  to  handle  the  full  human 
genome  of  3x10  bases.  Our  design  of  DNA  analysis  with  an 
optical  TIC  is  based  on  these  numbers. 


13 


6.2  Query  Sequence  Duration  Issues 


The  appearance  of  the  correlation  peak  is  usually  not 
synchronized  with  the  beginning  of  the  integration  periods.  For 
example,  a  peak  could  start  formation  halfway  through  an 
integration  period.  To  ensure  that  at  least  one  integration 
period  produces  a  peak  of  maximum  height,  the  duration  of  the 
query  sequence  d^  must  be  at  least  twice  the  detector  integration 
time  T.  If  we  assume  that  the  pedestal  removal  is  done  by 
subtracting  a  reference  frame,  we  must  then  have 

>  27  (4) 


If  the  phase  shift  pedestal  removal  technique  (see  section  III) 
was  used,  it  would  be  dj>4T  because  the  correlation  function  is 
obtained  from  the  subtraction  of  two  successive  frames  with  an 
effective  integration  time  of  2T. 

When  a  query  sequence  duration  is  short,  it  has  to  be 
stretched  to  meet  this  criteria.  One  approach  is  to  use  longer 
representations  of  the  DNA  bases.  Another  approach  is  to  repeat 
each  bit  enough  times  to  extend  sufficiently  the  query  sequence 
duration.  If  representations  with  bit  repetition  are  chosen  for 
the  bases  of  the  query  sequence,  the  same  representations  have 
naturally  to  be  used  for  the  bases  of  the  database.  Stretching 
the  query  sequence  duration  can  also  be  achieved  by  reducing  the 
bit  rate.  Whatever  stretching  method  is  used,  longer  analysis 
times  will  result. 

It  is  desired  that  the  query  sequence  duration  d^  be  greater 
than  2T,  to  ensure  the  formation  of  at  least  one  maximum  height 
peak.  However,  d^  should  also  be  less  than  2t,  the  time-delay 
window  of  the  TIC,  to  avoid  having  to  bring  the  correlation  peak 
within  the  time-delay  window  of  the  TIC  using  different  time- 
shifts  to  explore  all  possible  relative  delays  between  the  query 
sequence  and  the  database. 

6.3  Analysis  Times 

Table  2  lists  the  analysis  times,  as  a  function  of  the 
number  of  bases  in  the  cjuery  sequence,  associated  to  one  of  the 
many  possible  strategies  that  combine  bit  repetition  with  bit 
rate  adjustments  for  the  analysis  of  a  50x10®  bases  database 
corresponding  to  the  DNA  information  available  today. 
Representations  of  length  7  bits  have  been  used  (see  Table  1) . 

For  the  sake  of  discussion,  an  integration  time  T  of  50  ms 
(corresponding  to  a  1000-element  detector  array  with  a  20  MHz 
read-out  frequency)  and  a  time-delay  window,  2t,  of  200  ms 
(corresponding  to  a  100  ms  aperture  Bragg  cell)  were  selected. 
These  numbers  are  representative  specifications  of  commercial 
equipment  currently  available.  Table  2  was  compiled  by  adjusting 
the  bit  repetition  and  the  bit  rate  to  obtain  appropriate  query 


14 


(0  rH  0)  o 


to 

1^4  0)  >1  ^4 

o  to  Ul  O 

^  (0  0)  M-l 

J.  0)  J3  3 

5  ^4  trx 

o  «3  «J 

6  ID  (U 

5  o  CO  0)  a 

C  °  O’ 

^  o  c  c  c 
(U  o  (Q  O  O 
£  O  x:  --4 

I!  rH  4J  +J 
(0  <0 

Qi  T3  i4  rH 

o  C  (U  l4  <1» 

(0  O'  o  u 

C  C  IW  ^4 


4J  OJ 

n  C  to  to  <U 

c  u  <u  3  £ 

3  OJ  O  (0  4J 

?  c  0 

4J  0)  0)  T3 
n  <U  3  XI  c 
XJ  O’  -H 
in  <u  <u  <*-• 
S  to  to  o 

^  x:  CO 

^  4J  >  0)  -P 

S  O’  P  3 

S  C  O  D’TJ 

S  O  3  OJ  0) 

^  O'  to  p 

ij  •'4 

ffl  0 

«  0  O  P  O' 

.  #11  MX 


•P  3  g  <U 

o 

o 

in 

VO 

VO 

o 

in 

n 

CO 

O  C  -H  to 

o 

m 

fH 

00 

r' 

n 

CM 

rH 

CM 

El  <  H  ^ 

r' 

n 

rH 

rH 

rH 

0) 

P  g  CO 
0)  -H  4J 
X5  t4  <w 
fi 

3  <t-i  x: 


s 

0 

w 

rH 

rH 

rH 

rH 

rH 

rH 

rH 

fH 

(U 

to 

P 

to 

to 

•H 

O 

(0 

T3 

to 

<p 

x: 

c 

CM 

>1 

ou 

0 

o 

O 

in 

VO 

VO 

O 

in 

r> 

CO 

rH 

rH 

0) 

u 

o 

m 

IH. 

rH 

CO 

r- 

n 

CM 

rH 

<0 

g 

0) 

d) 

r* 

n 

rH 

rH 

c 

•H 

c 

to 

< 

Eh 

o 

■-r 

fO 

cn 

VO 

VO 

rn 

CO 

Hf 

O' 

o 

CO 

O 

r> 

O' 

O' 

tn 

N 

O' 

in 

r> 

in 

ro 

c 

>1 

0) 

rH 

rH 

rH 

rH 

rH 

rH 

rH 

rH 

rH 

CM 

0 

P 

0 

•H 

0) 

c 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

■p 

3 

(U 

^-N 

4J 

4J 

-P 

■p 

■p 

X» 

-p 

■p 

4-> 

IQ 

a 

3 

to 

P 

O’  a. 

00 

VTi 

Cvj 

O 

rH 

rH 

o 

o 

o 

O 

3 

<p 

0) 

VO 

O 

o 

o 

O 

o 

o 

o 

o 

O 

Q 

0 

03 

rH 

rH 

rH 

rH 

rH 

rH 

rH 

H 

rH 

rH 

3  3  0)  O’ 

M  O'B  Q> 

3  (U  -H  0)  P 

o  «i  4J  x:  (0 

■“  -p 
o  >  to  to 

O  p  -H  tw  4J 

.H  OJ  to  O  'M 

X  3  >  -H 

o  oti-^  x:  x:  • 

tn  (0  4J  to  0) 

C  O’  I  o 

(0  •  (0  c  tu  c 

01  <U  g  <u 

P  O  0)  rH  -H  3 

o  c  x;  4J  O’ 

Ip  0)  E-*  0)  (U 

3  js  <0  m 

UJ  O’  -P  P 

g  (I)  •  O  >1 

•H  to  TJ  to  g  P 

■P  0)  nj  0) 

>4J  '3 

to  P  <C  >1  0)  O’ 

•H  a>  P  rH  o 

to  3  4-)  P  C  P 

>1  O’  to  (0  0)  0) 

r-4  3  0)  3  O' 


(Q  0)  rH 

c  x:  rH 


C  O’  c 
■p  dj  o 


<  4J  -H  rH  to  rH 

m 

CO 

lA 

CM 

rH 

CO 

<N 

O 

•  t 

d>  to 

CO 

CM 

r' 

rH 

CN 

OH 

<N 

O  P  d) 

c  x:  d>  to 

iH 

CN 

'»■ 

in 

rH 

0 

0 

0 

0 

d) 

>1  d)  HJ  xj  flj 

0 

0 

0 

O 

0 

0 

HJ 

HJ 

HJ 

4J 

rH 

P  3  O’  g  CO 

+J 

HJ 

HJ 

HJ 

HJ 

4J 

A 

d)  O’  C  3 

m 

in 

VO 

a\ 

3  d)  <0  2 

CM 

in 

O' 

o 

CO 

CN 

rH 

CO 

CN 

Eh 

0»  CO  J  0 

rH 

rH 

CM 

in 

rH 

CN 

CN 

-  15  - 


sequence  durations.  The  query  sequence  lengths  used  ranged  from 
100  ns  (twice  the  integration  time  T)  to  200  /is  (the  time-delay 
window  of  the  TIC)  for  the  reasons  given  in  Section  6.2.  The 
transition  from  dual  to  single  bit  representation,  and  to  higher 
bit  rates  is  made  for  the  shortest  query  sequence  length  possible 
to  obtain  minimum  processing  times.  Only  one  TIC  was  assumed  to 
be  available  to  do  the  processing.  The  data  in  Table  2  is 
plotted  in  Figure  8  where  the  left  side  uses  a  log-log  axis  and 
covers  query  sequences  of  length  12  to  857.  A  semi-log  axis  is 
more  convenient  for  the  right  side  because  the  analysis  time 
varies  linearly  with  the  length  of  the  query  sequence.  The 
abscissa  and  the  ordinate  are  drawn  on  a  logarithmic  and  a  linear 
scale,  respectively. 

6.4  Examples 

Examples  drawn  from  Table  2  are  discussed  in  detail  in  the 
following  paragraphs.  Let  us  assume  that  we  do  not  want  to  use  a 
bit  rate  less  than  1  MHz  and  that  the  database  contains  50x10*^ 
bases.  For  an  integration  time  T  of  50  /is,  the  query  sequence 
duration  d^  is: 

-  n  Rrj  IX  (5) 

where  n  is  the  number  of  bases  in  the  query  sequence,  R  is  the 
length  of  the  representations,  r  is  the  repetition  of  the  bits 
and  D  is  the  bit  rate.  To  ensure  at  least  one  frame  with  a  full 
height  correlation  peak  the  query  sequence  duration  d^  should 
therefore  be  at  least  100  ms. 

1)  Let  us  consider  a  query  sequence  that  contains  12  bases 
(see  Table  2).  If  a  seven  bit  long  representation  is 
used  with  a  repetition  of  two  and  a  bit  rate  of  1  MHz, 
the  query  sequence  has  a  duration  of  168  ms.  The 
database  has  to  be  gener|ited  with  the  same  parameters. 

A  database  made  of  50x10*  bases  will  have  a  duration  d^ 
of  700  seconds.  The  200  ms  time-delay  window  of  the 
TIC  is  sufficient  to  see  the  whole  query  sequence,  so 
only  one  time-shift  has  to  be  tried  and  one  700  second 
run  is  required  to  analyse  the  data. 

2)  It  is  possible  to  analyse  query  sequences  that  are 
between  12  and  14-bases  long  with  the  same  parameters 
because  a  14-bases  long  query  sequence  nas  a  duration 
of  196  MS.  However,  it  is  advantageous  to  reduce  the 
analysis  time  by  introducing  a  lower  number  of 
repetitions  and  by  increasing  the  bit  rate  as  the 
number  of  bases  in  the  query  sequence  increases. 
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NUMBER  OF  BASES  IN  THE  QUERY  SEQUENCE 


Figure  8; 


Processing  time  for  the  analysis  of  a  50x10  bases 
database  as  a  function  of  the  number  of  bases  in 
the  query  sequence.  The  left  of  the  figure  uses 
log-log  axis  and  covers  query  sequences  of  length 
12  to  857.  Semi-log  axis  are  more  convenient  for 
the  right  of  the  figure  because  the  analysis  time 
varies  linearly  with  the  length  of  the  query 
sequence.  The  abscissa  and  ordinate  are 
respectively  drawn  on  a  logarithmic  scale  and  a 
linear  scale. 


3)  Query  sequences  containing  between  15  and  28  bases  can 
be  analyzed  with  a  repetition  of  one  and  a  bit  rate  of 
1  MHz. 

4)  When  the  query  sequences  contain  more  than  429  bases  it 
is  possible  to  use  the  maximum  bit  rate  of  30  MHz.  From 
there,  the  analysis  time  grows  linearly  with  the  length 
of  the  query  sequence  and  the  number  of  time-shifts 
that  have  to  be  used.  Longer  query  sequences  produce 
many  consecutive  correlation  peaks.  For  example,  a 
500-base^  query  sequence,  a  3000-bases  query  sequence 
and  a  10^-bases  query  sequence  have  respective 
durations  of  116.7  /is,  700  /is  and  23333  /is  when 
processed  at  a  bit  rate  of  30  MHz.  One,  four  and  117 
time-shifts  have  to  be  tried  thus  producing  total 
analysis  times  of  12  seconds,  48  seconds  and  22  minutes 
for  a  database  containing^ 50xl0“  bases.  Performing  the 
same  analysis  on  all  3xlo’  bases  of  the  human  genome 
would  take  respectively  12  minutes,  48  minutes  and  23 
hours . 

It  is  possible  at  this  point  to  compare,  for  one  particular 
case,  the  analysis  times  of  the  TIC  described  here  with  current 
digital  technologies .  The  analysis  of  a  3000-bases  query 
sequence  in  a  50x10*  database  is  performed  in  48  seconds  with  the 
TIC,  in  8  minutes  with  a  80  MIPS  mainframe  computer  and  in  over  5 
hours  with  a  personal  computer  operating  at  25  MHz  (see  section 
II) .  The  TIC  is  then  10  times  faster  than  the  80  MIPS  computer 
and  over  375  times  faster  than  the  personal  computer. 

We  assume  that  only  one  TIC  is  available  to  perform  the 
analysis  so  the  various  time-shifts  have  to  be  tried 
consecutively.  If  more  than  one  TIC  is  available,  parallel 
processing  of  many  200  /xs  windows  is  possible  and  the  analysis 
time  is  divided  by  the  number  of  TICs  operating  in  parallel.  If 
four  TICs  are  available,  the  analysis  times  for  the  examples  of 
the  preceding  paragraph  are  3  minutes,  12  minutes  and  5.8  hours. 
Compact  architectures [17-19]  have  been  developed  for  the 
construction  of  TICs  and  such  an  approach  applied  to  the 
reduction  of  processing  time  is  very  feasible. 

The  next  step  in  the  coarse  analysis  is  to  transform  the 
time  of  occurrence  of  the  peaks  into  location  in  the  database.  A 
rough  estimation  of  the  time  of  occurrence  is  provided  by  the 
frame  number  where  the  peaks  are  found.  It  is  possible  to  make  a 
more  precise  determination  of  the  time  of  occurrence  by  finding 
the  location  of  the  peak  within  the  frame.  At  this  point,  the 
coarse  analysis  is  complete  and  a  fine  analysis  of  the  segments 
previously  identified  as  interesting  has  to  be  performed. 
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7.0  STRATEGY  FOR  FINE  ANALYSIS 


The  purpose  of  fine  analysis  is  to  produce  a  base-by-base 
comparison  between  the  database  and  the  query  sequence.  The 
presence  of  any  discrepancies  will  be  revealed  with  all  the 
details  of  these  features.  The  key  to  fine  analysis  is  to  use 
lower  data  rates,  representations  of  the  bases  that  are  much 
longer  and  to  perform  the  analysis  only  on  the  segments  of 
interest  identified  by  the  coarse  analysis.  Maximum  length 
pseudorandom  sequences  containing  127  bits  (see  Table  3)  and  an 
integration  time  of  127  ns  could  be  used  with  a  data  rate  of  1 
MHz. 


Table  3:  Long  representations  of  the  DNA  bases  with  127-bits 
maximum  length  pseudorandom  sequences  that  are 
designated  [35,p.62]  By  their  octal  and  their 
polynomial  representations. 


Adenine  (A) 
Cytosine  (C) 
Guanine  (G) 
Thymine  (T) 
Unknown  (N) 


octal 

polynomial 

representation 

representation 

203 

X^  +  X  +  1 

211 

7  ,  3,- 

217 

x^  +  x^  +  X^  X  + 

221 

235 

7  4 

X  +  X  +  1 

7  ,  ,  3  ,  2 

X  +X  +  X  +  X 

When  the  TIC  operates  in  this  mode  (see  Figure  7) ,  the 
correlation  of  the  database  bases  should  be  synchronized  with  the 
bases  of  the  query  sequence  to  optimise  the  height  of  the 
correlation  peaks.  The  controller  of  the  system  and  the  access 
to  the  memory  containing  the  query  sequence  and  the  database 
should  be  designed  with  enough  flexibility  to  provide  the 
capability  to  move  back  and  forth  in  the  memory  in  order  to 
analyse  in  detail  the  gaps  and  discrepancies  between  the  query 
sequences  and  the  database.  The  time  required  to  do  this 
analysis  is  a  linear  function  of  the  number  of  bases  in  the  query 
sequence.  As  it  takes  127  ns  to  confirm  the  presence  of  a 
particular  base  at  a  particular  location  in^ the  database,  a 
detailed  analysis  of  a  3000-bases  and  a  lo’-bases  query  sequence 
takes  less  than  2  seconds  and  16  seconds  respectively.  A  20% 
time  overhead  is  added  for  the  determination  of  the  parameters  of 
gaps  and  the  exact  location  of  the  beginning  of  the  query 
sequence  in  the  database.  If  there  is  more  than  one  occurrence 
of  the  query  sequence  in  the  database,  the  fine  analysis  has  to 
repeated  each  time.  The  analysis  of  the  reverse  complementary 
strand  of  a  particular  query  sequence  should  be  treated  as  a  new 
experiment  with  a  different  query  sequence  and  the  coarse  and 
fine  analysis  have  to  be  repeated. 
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8.0  HIGH  PERFORMANCE  OPTICAL  PROCESSING  FOR  DNA  ANALYSIS 


8 . 1  Introduction 

A  system  dedicated  to  the  analysis  of  DNA  sequences  based  on 
a  TIC  contains  conceptually  three  parts,  1)  the  optical 
correlator  2),  the  Controller  and  3),  a  fast-access,  large 
capacity  storage  unit  for  the  data  from  the  query  sequence  and 
the  database.  The  feasibility  of  the  construction  and  operation 
of  the  first  two  parts  of  the  system  have  already  been 
demonstrated  at  the  Defence  Research  Establishment  Ottawa. 

8.2  Selection  of  the  Components  of  a  TIC  For  DNA  Analysis 

Considering  the  large  amount  of  data  to  be  processed  for  DNA 
analysis,  the  components  of  the  TIC  and  the  operating  procedures 
should  be  selected  to  provide  maximum  speed  of  operation.  The 
Bragg  cells  should  have  the  largest  possible  transit  time  t  to 
maximize  the  time-delay  window  2t  in  which  to  observe  a 
correlation  peak  between  two  signals.  The  bandwidth  of  the  Bragg 
cells  should  be  maximized  also  for  maximum  bit  rate  operation. 

In  order  to  avoid  producing  a  distorted  correlation  peak  [34, 
p.24]  the  bit  rate  should  not  exceed  the  Bragg  cell  bandwidth 
divided  by  1.5.  The  integration  time  T  of  the  detector  array 
should  be  minimized  to  perform  the  reading  operation  at  a 
reasonable  rate  but  have  enough  elements  to  provide  sufficient 
resolution  of  the  time-delay  window  and  produce  accurate 
determination  of  the  peak  position. 

A  design  based  on  commercially  available  equipment  could 
include  TeOj  Bragg  cells  with  a  100  /xs  time  aperture  and  a  50  MHz 
bandwidth.  A  maximum  data  rate  of  30  MHz  could  then  be  used. 
Detector  arrays  with  1000  elements,  a  read-out  rate  of  20  MHz  and 
a  minimum  integration  time  of  50  ns  are  available  thus  generating 
20000  correlations  per  second. 

8.3  The  Controller 

Most  aspects  of  the  operation  of  the  TIC  are  controlled  by  a 
computer  that  we  call  the  Controller.  The  Controller  should 
maintain  an  interface  with  the  user  that  allows  1)  the  selection 
of  the  database  to  the  used  for  analysis,  2)  the  input  of  the 
query  sequence  to  be  analyzed  and  3)  the  display  of  the  results 
in  a  format  familiar  to  the  user.  The  Controller  is  also 
responsible  for  the  input  of  the  data  to  the  TIC  and  for  the 
collection  of  data  from  the  detector  array,  including  pedestal 
removal  and  peak  detection.  It  should  contain  algorithms  to 
define  the  parameters  of  operation  for  the  coarse  and  the  fine 
analysis  and  should  decide  on  the  base  representation  length, 
number  of  repetition  of  the  bits  and  the  data  rate  to  be  used. 

It  should  also  determine  the  number  of  repetition  required  of  the 
analysis  for  various  time-shifted  versions  of  the  query  sequence. 
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Work  already  performed  at  DREO  on  a  similar  system  for  the 
processing  of  communication  signals  has  demonstrated  the 
feasibility  of  such  a  system  for  real-time  operation  at  a  data 
rate  of  30  MHz. 

8.4  Fast  Access  Storage  for  Data  Input  to  the  TIC 

The  storage  unit  should  store  billions  of  DNA  bases  and 
prepare  the  signal  representations  required  by  the  Bragg  cells. 
A  capability  to  send  the  data  to  the  TIC  at  bit  rates  between  1 
and  30  MHz  should  be  available.  Flexible,  fast  access  to  any 
part  of  the  data  is  particularly  important  for  rapid  fine 
analysis.  Preliminary  consultations  led  to  the  conclusion  that 
this  task  is  not  trivial  but  is  feasible  with  existing 
technology. 


9.0  CONCLUSION 

Elements  of  optical  data  processing  and  spread-spectrum 
communication  theory  have  been  integrated  to  present  a  proposal 
for  the  analysis  of  DNA  sequences  with  an  optical  TIC.  An 
analysis  strategy  including  a  coarse  and  a  fine  analysis  was 
developed  and  the  resulting  processing  times  were  calculated.  It 
was  concluded  that  TICs  could  produce  a  substantial  improvement 
in  DNA  analysis  processing  times.  Comparison  of  the  expected 
processing  times  of  a  TIC,  for  a  particular  case,  lead  to  the 
conclusion  that  the  TIC  could  be  10  times  faster  than  a  80  MIPS 
computer  and  over  375  times  faster  than  a  personal  computer.  The 
requirements  of  an  operational  system  were  outlined. 
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